Intelligent Selection of Language Model Training Data

نویسندگان

Robert C. Moore

William D. Lewis

چکیده

We address the problem of selecting nondomain-specific language model training data to build auxiliary language models for use in tasks such as machine translation. Our approach is based on comparing the cross-entropy, according to domainspecific and non-domain-specifc language models, for each sentence of the text source used to produce the latter language model. We show that this produces better language models, trained on less data, than both random data selection and two other previously proposed methods.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Do the Emotionally More Intelligent Gain More from Metacognitive Writing Strategy Training?

Though privileges ascribed to various facets of language learning strategy training have long been espoused with regard to varied language skills and components, the role some individual variables such as emotional intelligence might play in this respect seems to have received very scant attention. The researchers in the current study embarked on a probe into the impact of metacognitive strateg...

متن کامل

MLIFT: Enhancing Multi-label Classifier with Ensemble Feature Selection

Multi-label classification has gained significant attention during recent years, due to the increasing number of modern applications associated with multi-label data. Despite its short life, different approaches have been presented to solve the task of multi-label classification. LIFT is a multi-label classifier which utilizes a new strategy to multi-label learning by leveraging label-specific ...

متن کامل

A Phoneme-Based Student Model for Adaptive Spelling Training

We present a novel phoneme-based student model for spelling training. Our model is data driven, adapts to the user and provides information for, e.g., optimal word selection. We describe spelling errors using a set of features accounting for phonemic, capitalization, typo, and other error categories. We compute the influence of individual features on the error expectation values based on previo...

متن کامل

A hybrid CS-SA intelligent approach to solve uncertain dynamic facility layout problems considering dependency of demands

This paper aims at proposing a quadratic assignment-based mathematical model to deal with the stochastic dynamic facility layout problem. In this problem, product demands are assumed to be dependent normally distributed random variables with known probability density function and covariance that change from period to period at random. To solve the proposed model, a novel hybrid intelligent algo...

متن کامل

Design and Implementation of an Intelligent Part of Speech Generator

The aim of this paper is to report on an attempt to design and implement an intelligent system capable of generating the correct part of speech for a given sentence while the sentence is totally new to the system and not stored in any database available to the system. It follows the same steps a normal individual does to provide the correct parts of speech using a natural language processor. It...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2010

Intelligent Selection of Language Model Training Data

نویسندگان

چکیده

منابع مشابه

Do the Emotionally More Intelligent Gain More from Metacognitive Writing Strategy Training?

MLIFT: Enhancing Multi-label Classifier with Ensemble Feature Selection

A Phoneme-Based Student Model for Adaptive Spelling Training

A hybrid CS-SA intelligent approach to solve uncertain dynamic facility layout problems considering dependency of demands

Design and Implementation of an Intelligent Part of Speech Generator

عنوان ژورنال:

اشتراک گذاری